Code
suppressMessages(reticulate::py_install("scipy"))
suppressMessages(reticulate::py_install("plotly"))
suppressMessages(reticulate::py_install("scikit-learn"))jl3205
This analysis builds on the EDA by uncovering systematic risk disparities, emerging threat patterns, and potential predictive signals. Here, the focus shifts to understanding why these patterns emerge, how they cluster, and what they imply for decision-making in the field.
Are national staff disproportionately affected by violence?
The following analysis compares harm profiles between national and international aid workers across three key outcomes: fatalities, injuries, and abductions.
import numpy as np
import pandas as pd
from scipy.stats import ttest_ind
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import plotly.express as px
df = pd.read_csv("clean_data/security_incidents_cleaned.csv")
def plot_victim_impact(df, national_col, international_col, label):
filtered_df = df[(df[national_col] > 0) | (df[international_col] > 0)]
long_df = pd.DataFrame({
label: pd.concat([
filtered_df[national_col],
filtered_df[international_col]
], ignore_index=True),
"Role": ["Nationals"] * len(filtered_df) + ["Internationals"] * len(filtered_df)
})
t_stat, p_val = ttest_ind(
filtered_df[national_col],
filtered_df[international_col],
equal_var=False
)
plt.figure(figsize=(8, 5))
sns.boxplot(
data=long_df,
x="Role", y=label,
hue="Role", dodge=False,
palette={"Nationals": "#4EA8DE", "Internationals": "#F4A261"}
)
plt.title(label, fontsize=14, weight="bold")
plt.ylabel("Amount")
plt.xlabel("")
plt.tick_params(labelsize=11)
plt.grid(axis="y", linestyle="--", alpha=0.4)
legend = plt.gca().get_legend()
if legend:
legend.remove()
annotation = f"P < 0.0001" if p_val < 0.0001 else f"T = {t_stat:.2f}\nP = {p_val:.4f}"
plt.text(
0.5, plt.ylim()[1]*0.85,
annotation,
ha='center', va='top',
fontsize=14, fontweight='bold', color='white',
bbox=dict(facecolor='black', edgecolor='white',
boxstyle='round,pad=0.6')
)
plt.tight_layout()
plt.show()
plot_victim_impact(df, "Nationals killed",
"Internationals killed", "Fatalities")Figure 8: Fatalities are significantly higher among nationals - stressing the disproportionate frontline exposure in high-threat environments
Figure 9: Injury counts further highlight the disproportionate risk faced by nationals in high-risk deployments.
Across all three types of harm, nationals consistently face significantly higher rates of violence. This structural risk disparity underscores the need to reassess field deployment strategies, training programs, and protective measures specifically tailored to national teams.
Do certain actor types favor particular forms of violence?
Understanding the relationship between perpetrator type and their chosen tactics provides valuable insight into operational risk profiling.
contingency = pd.crosstab(df["Actor type"], df["Means of attack"])
chi2, p, dof, expected = chi2_contingency(contingency)
# print(f"Chi-square statistic: {chi2:.2f}, p-value: {p:.4f}")
plt.figure(figsize=(10, 6))
sns.heatmap(contingency, cmap="Reds", annot=True, fmt="d")
plt.title("Actor Type vs. Means of Attack")
plt.ylabel("Actor Type")
plt.xlabel("Means of Attack")
plt.tight_layout()
plt.show()Figure 11: This matrix reveals clear associations between threat actor types and their preferred methods of attack.
Can we group incidents into clusters that reveal patterns in threat profiles?
Using PCA and KMeans clustering, I reduced the ‘incident’ feature space to reveal emergent threat archetypes. These clusters can help field actors recognize common attack profiles and identify anomalies.
features = df[["UN", "INGO", "ICRC", "NNGO", "Other", "Nationals killed",
"Internationals killed", "Means of attack", "Actor type"]].copy()
features = pd.get_dummies(
features, columns=["Means of attack", "Actor type"], drop_first=True)
X = StandardScaler().fit_transform(features.fillna(0))
kmeans = KMeans(n_clusters=4, random_state=42)
df["Cluster"] = kmeans.fit_predict(X)
pca = PCA(n_components=2)
components = pca.fit_transform(X)
df["PC1"] = components[:, 0]
df["PC2"] = components[:, 1]
fig = px.scatter(df, x="PC1", y="PC2", color="Cluster",
title="Incident Clusters (Threat Archetypes)",
hover_data=["Means of attack", "Actor type"])
fig.show()Figure 12: PCA combined with KMeans clustering reveals distinct archetypes of threat incidents. Most events fall into dense clusters, indicating recurring operational patterns, while a handful of outlier points highlight rare or extreme scenarios—such as isolated, high-casualty attacks. These insights lay the groundwork for risk profiling and the development of early warning systems in humanitarian security planning.
Can we predict whether an attack will be fatal based on key attributes?
Logistic regression is defined as a statistical modeling technique used to predict binary outcomes — such as whether an incident results in a fatality — based on a set of input features.
As proof of concept, I trained a logistic regression model using organizational presence, perpetrator type, and method of attack to predict whether an incident results in fatalities.
Logistic regression was selected for its simplicity and interpretability — key in high-stakes decision environments. The model incorporates class balancing to address skewed fatality rates and ensure sensitivity to rare, high-risk outcomes.
np.random.seed(5300)
df["Fatal"] = (df["Nationals killed"] + df["Internationals killed"]) > 0
features = df[["UN", "INGO", "ICRC", "NNGO", "Means of attack", "Actor type"]]
features = pd.get_dummies(features, drop_first=True).fillna(0)
X = features
y = df["Fatal"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
model = LogisticRegression(
max_iter=1000,
class_weight='balanced',
solver='liblinear',
random_state=42
)
model.fit(X_train, y_train)LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42,
solver='liblinear')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42,
solver='liblinear')y_pred = model.predict(X_test)
report_dict = classification_report(y_test, y_pred, output_dict=True)
report_df = pd.DataFrame(report_dict).T.loc[["False", "True"], [
"precision", "recall", "f1-score"]]
report_df.index = ["No Fatality", "Fatality"]
report_df.plot(kind='bar', figsize=(9, 5), color=[
"#4EA8DE", "#F4A261", "#E76F51"])
plt.title("Logistic Regression Performance by Class",
fontsize=14, weight="bold")
plt.ylabel("Score")
_ = plt.ylim(0, 1.05)
_ = plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.4)
_ = plt.legend(title="Metric", bbox_to_anchor=(1.05, 1), loc="upper left")
plt.tight_layout()
plt.show()Figure 13: This proof-of-concept model demonstrates the viability of forecasting fatal attacks using features such as organizational footprint and threat type.